Overview

In professional sports, there is a huge interest in attempting to leverage historic statistics to both predict future outcomes (wins/losses) and explore opportunities for tuning or improving a team or individual’s performance. This data-driven approach to sports has gained a large following over the last decade and entered mass media in the form of fantasy leagues, movies (e.g. Moneyball), and websites/podcasts (e.g. FiveThirtyEight). In this analysis, we will be using a classic baseball data set with the goal to build several different models capable of predicting team wins over a season given on other team stats during that season (i.e. homeruns, strikeouts, base hits, etc). The data set containing approximately 2200 records, representing professional baseball team statistics from the years 1871 to 2006 inclusive and each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season.

We will first explore the data looking for issues or challenges (i.e. missing data, outliers, possible coding errors, multicollinearlity, etc). Once we have a handle on the data, we will apply any necessary cleaning steps. Once we have a reasonable dataset to work with, we will build and evaluate three different linear models that predict seasonal wins. Our dataset includes both training data and evaluation data - we will training using the main training data, then evaluate models based on how well they perform against the holdout evaluation data. Finally we will select a final model that offers the best compromise for accuracy and simplicity.

1. Data Exploration

Describe the size and the variables in the moneyball training data set. Consider that too much detail will cause a manager to lose interest while too little detail will make the manager consider that you aren’t doing your job. Some suggestions are given below. Please do NOT treat this as a check list of things to do to complete the assignment. You should have your own thoughts on what to tell the boss. These are just ideas.

Dataset

Columns

Descriptive text … Lorem ipsum dolor sit amet, consectetur adipiscing elit. Fusce ut augue pharetra, luctus lectus ut, rutrum quam. Aenean quis tellus ac felis accumsan pellentesque id ut purus. Fusce eget ligula eu est congue aliquet. Vivamus hendrerit felis varius lorem suscipit venenatis. Fusce facilisis arcu ac lorem cursus, non pretium velit finibus. Suspendisse eu nulla tellus. Nunc viverra elementum dolor, ut scelerisque nisl. Ut iaculis faucibus ultricies. Praesent fermentum eu libero et consequat. Phasellus vitae euismod lectus, a ultrices dui. Nunc vel leo rhoncus, cursus elit quis, rhoncus nisl. Aenean id urna et nibh tempor iaculis nec non quam. In tincidunt luctus ex eget viverra.

Variables of Interest

Lets look at the first few rows to get a sense for the data.

INDEX TARGET_WINS BATTING_H BATTING_2B BATTING_3B BATTING_HR BATTING_BB BATTING_SO BASERUN_SB BASERUN_CS BATTING_HBP PITCHING_H PITCHING_HR PITCHING_BB PITCHING_SO FIELDING_E FIELDING_DP
1 39 1445 194 39 13 143 842 NA NA NA 9364 84 927 5456 1011 NA
2 70 1339 219 22 190 685 1075 37 28 NA 1347 191 689 1082 193 155
3 86 1377 232 35 137 602 917 46 27 NA 1377 137 602 917 175 153
4 70 1387 209 38 96 451 922 43 30 NA 1396 97 454 928 164 156
5 82 1297 186 27 102 472 920 49 39 NA 1297 102 472 920 138 168

Note we will drop the INDEX column as it offers no information towards predictive models.

Dimensions

## [1] 2276   16

Summary Stats

Descriptive text … Lorem ipsum dolor sit amet, consectetur adipiscing elit. Fusce ut augue pharetra, luctus lectus ut, rutrum quam. Aenean quis tellus ac felis accumsan pellentesque id ut purus. Fusce eget ligula eu est congue aliquet. Vivamus hendrerit felis varius lorem suscipit venenatis. Fusce facilisis arcu ac lorem cursus, non pretium velit finibus. Suspendisse eu nulla tellus. Nunc viverra elementum dolor, ut scelerisque nisl. Ut iaculis faucibus ultricies. Praesent fermentum eu libero et consequat. Phasellus vitae euismod lectus, a ultrices dui. Nunc vel leo rhoncus, cursus elit quis, rhoncus nisl. Aenean id urna et nibh tempor iaculis nec non quam. In tincidunt luctus ex eget viverra.

##   TARGET_WINS       BATTING_H      BATTING_2B      BATTING_3B    
##  Min.   :  0.00   Min.   : 891   Min.   : 69.0   Min.   :  0.00  
##  1st Qu.: 71.00   1st Qu.:1383   1st Qu.:208.0   1st Qu.: 34.00  
##  Median : 82.00   Median :1454   Median :238.0   Median : 47.00  
##  Mean   : 80.79   Mean   :1469   Mean   :241.2   Mean   : 55.25  
##  3rd Qu.: 92.00   3rd Qu.:1537   3rd Qu.:273.0   3rd Qu.: 72.00  
##  Max.   :146.00   Max.   :2554   Max.   :458.0   Max.   :223.00  
##                                                                  
##    BATTING_HR       BATTING_BB      BATTING_SO       BASERUN_SB   
##  Min.   :  0.00   Min.   :  0.0   Min.   :   0.0   Min.   :  0.0  
##  1st Qu.: 42.00   1st Qu.:451.0   1st Qu.: 548.0   1st Qu.: 66.0  
##  Median :102.00   Median :512.0   Median : 750.0   Median :101.0  
##  Mean   : 99.61   Mean   :501.6   Mean   : 735.6   Mean   :124.8  
##  3rd Qu.:147.00   3rd Qu.:580.0   3rd Qu.: 930.0   3rd Qu.:156.0  
##  Max.   :264.00   Max.   :878.0   Max.   :1399.0   Max.   :697.0  
##                                   NA's   :102      NA's   :131    
##    BASERUN_CS     BATTING_HBP      PITCHING_H     PITCHING_HR   
##  Min.   :  0.0   Min.   :29.00   Min.   : 1137   Min.   :  0.0  
##  1st Qu.: 38.0   1st Qu.:50.50   1st Qu.: 1419   1st Qu.: 50.0  
##  Median : 49.0   Median :58.00   Median : 1518   Median :107.0  
##  Mean   : 52.8   Mean   :59.36   Mean   : 1779   Mean   :105.7  
##  3rd Qu.: 62.0   3rd Qu.:67.00   3rd Qu.: 1682   3rd Qu.:150.0  
##  Max.   :201.0   Max.   :95.00   Max.   :30132   Max.   :343.0  
##  NA's   :772     NA's   :2085                                   
##   PITCHING_BB      PITCHING_SO        FIELDING_E      FIELDING_DP   
##  Min.   :   0.0   Min.   :    0.0   Min.   :  65.0   Min.   : 52.0  
##  1st Qu.: 476.0   1st Qu.:  615.0   1st Qu.: 127.0   1st Qu.:131.0  
##  Median : 536.5   Median :  813.5   Median : 159.0   Median :149.0  
##  Mean   : 553.0   Mean   :  817.7   Mean   : 246.5   Mean   :146.4  
##  3rd Qu.: 611.0   3rd Qu.:  968.0   3rd Qu.: 249.2   3rd Qu.:164.0  
##  Max.   :3645.0   Max.   :19278.0   Max.   :1898.0   Max.   :228.0  
##                   NA's   :102                        NA's   :286

Distributions

## Warning: Removed 3478 rows containing non-finite values (stat_bin).
## Warning: Removed 3478 rows containing non-finite values (stat_density).

## Warning: Removed 3478 rows containing non-finite values (stat_boxplot).

Looking at both the histograms and boxplots, we notice that many of the features are NOT normally distributed (a concern for linear regressions), there are a significant number of outliers that we will have to consider, and we see a large number of zero values for some features. For zero values and outliers, we will want to qualitatively assess whether those make sense and whether we need to transform them in some way.

Variable Plots

featurePlot(df[,2:ncol(df)], df[,1], pch = 20)

Here we plot every variable against the target variable number of wins to get a feeling for which variables may be predictive of wins. The plots indicate some clear relationships, such as hitting more doubles or more home runs clearly improves the number of wins.

The plots also reveal significant issues with the data. First, there are many data points that contain missing data that will need to be either imputed or discarded. Second, it appears we have some missing data encoded as 0 and some nonsensical outliers.

There is a team with 0 wins in the dataset. This seems unlikely. Many of the hitting categories include teams at 0; it is unlikely that a team hit 0 home runs over the course of a season.

The pitching variables also include many 0’s, for instance there are multiple teams with 0 strikeouts by their pitchers over the season which is extremely unlikely. The pitching data also includes strange outliers such as a team loggin 20000 strikeouts, that would be an average of 160 strikeouts per game which is impossible. Also team pitching walks and team pitching hits have strange outliers.

Lastly, the error variable makes no sense to me. From my experience watching baseball, teams usually score 2 or less errors per game, which would lead to an overall team error of approximately 320 over the course of a season, which does not match the scale of the error variable.

Missing Data

When we initially viewed the first few rows of the raw data, we already noticed missing data. Let’s assess which fields have missing data.

##    values         ind
## 1   91.61 BATTING_HBP
## 2   33.92  BASERUN_CS
## 3   12.57 FIELDING_DP
## 4    5.76  BASERUN_SB
## 5    4.48  BATTING_SO
## 6    4.48 PITCHING_SO
## 7    0.00 TARGET_WINS
## 8    0.00   BATTING_H
## 9    0.00  BATTING_2B
## 10   0.00  BATTING_3B
## 11   0.00  BATTING_HR
## 12   0.00  BATTING_BB
## 13   0.00  PITCHING_H
## 14   0.00 PITCHING_HR
## 15   0.00 PITCHING_BB
## 16   0.00  FIELDING_E

Notice that ~91.6% of the rows are missing the BATTING_HBP field - we will just drop this column from onsideration. The columns BASERUN_CS (base run caught stealing) and BASERUN_SB (stolen bases) both have missing values. According to baseball history, stolen bases weren’t tracked officially until 1887, so some of the missing data could be from 1871-1886. We will impute those value. There are a high percentage of missing BATTING_SO (batter strike outs) and PITCHING_SO (pitching strike outs) which seem highly unlikely - we will also impute those missing values.

Feature-Target Correlations

##          values         ind
## 1   0.383313355   BATTING_H
## 2   0.285964582  BATTING_2B
## 3   0.225471523  BATTING_BB
## 4   0.186326615 PITCHING_HR
## 5   0.173551988  BATTING_HR
## 6   0.164738136 PITCHING_BB
## 7   0.139073830  BATTING_3B
## 8   0.121110012  BASERUN_SB
## 9   0.105034615  PITCHING_H
## 10  0.009835166  BASERUN_CS
## 11 -0.030050754 FIELDING_DP
## 12 -0.037712096  BATTING_SO
## 13 -0.083637577  FIELDING_E
## 14 -0.089519609 PITCHING_SO

When evaluating features for including in models, we will want to choose those with stronger positive or negaive correlations. Features with correlations closer to zero will probably not provide any meaning information on explaining wins by a team.

Pairplot

Multicolinearlity

When we start considering features for our models, we’ll need to account for the correlations betwwen features and avoid including pairs with storng correlations.

2. Data Preparation

Removed Fields

We removed the BATTING_HBP field as it was missing >90% of the data and the INDEX field as it offers no information for a model.

Missing Values

Missing values found in BASERUN_CS (Caught Stolen Bases), FIELDING_DP (Double plays), BASERUN_SB (Stolen Bases), BATTING_SO (Batter Strike outs), PITCHING_SO (Pitcher Strike outs) were all replaced with median values. It is highly unlikely that teams had none of these during an entire season.

Outliers

There are unreasonable outliers found in PITCHING_SO (pitching strikeouts), PITCHIN_H (allowed hits per game), PITCHING_BB (walks), and FIELDING_E (fielding errors) that exceed what is reasonable or possible given standard game length. While specific games might have outliers (e.g. in a game with extra innings), we wouldn’t expect the totals per season to allow for outliers in every game. Given this, we will replace any outliers with the median fro the data set. Limits we set included: > 4000 PITCHING_SO (25 striekouts per game), > 5000 PITCHING_H (30 hits allowed per game), > 2000 PITCHING_BB (13 walks per game) and > 480 FIELDING_E (3 errors per game).

Transform non-normal

3. Build Models

Using the training data set, build at least three different multiple linear regression models, using different variables (or the same variables with different transformations). Since we have not yet covered automated variable selection methods, you should select the variables manually (unless you previously learned Forward or Stepwise selection, etc.). Since you manually selected a variable for inclusion into the model or exclusion into the model, indicate why this was done.

Discuss the coefficients in the models, do they make sense? For example, if a team hits a lot of Home Runs, it would be reasonably expected that such a team would win more games. However, if the coefficient is negative (suggesting that the team would lose more games), then that needs to be discussed. Are you keeping the model even though it is counter intuitive? Why? The boss needs to know.

## 
## Call:
## lm(formula = TARGET_WINS ~ ., data = clean_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -62.246  -8.040   0.149   8.459  64.706 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 19.8791382  5.5609672   3.575 0.000358 ***
## BATTING_H    0.0435068  0.0037271  11.673  < 2e-16 ***
## BATTING_2B  -0.0181980  0.0091059  -1.998 0.045785 *  
## BATTING_3B   0.0887800  0.0166055   5.346 9.87e-08 ***
## BATTING_HR  -0.0498772  0.0307896  -1.620 0.105385    
## BATTING_BB   0.0668013  0.0051197  13.048  < 2e-16 ***
## BATTING_SO   0.0043819  0.0039536   1.108 0.267841    
## BASERUN_SB   0.0199677  0.0041255   4.840 1.39e-06 ***
## BASERUN_CS   0.0054960  0.0154463   0.356 0.722013    
## PITCHING_H   0.0031034  0.0009168   3.385 0.000723 ***
## PITCHING_HR  0.0875913  0.0272043   3.220 0.001301 ** 
## PITCHING_BB -0.0343166  0.0043823  -7.831 7.39e-15 ***
## PITCHING_SO -0.0059484  0.0028921  -2.057 0.039826 *  
## FIELDING_E  -0.0431197  0.0051761  -8.331  < 2e-16 ***
## FIELDING_DP -0.1453152  0.0132805 -10.942  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.01 on 2260 degrees of freedom
## Multiple R-squared:  0.3141, Adjusted R-squared:  0.3099 
## F-statistic: 73.93 on 14 and 2260 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = TARGET_WINS ~ BATTING_H + BATTING_2B + BATTING_3B + 
##     BATTING_BB + BASERUN_SB + PITCHING_H + PITCHING_HR + PITCHING_BB + 
##     PITCHING_SO + FIELDING_E + FIELDING_DP, data = clean_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -61.051  -7.965   0.164   8.359  65.549 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 21.7189057  4.9999664   4.344 1.46e-05 ***
## BATTING_H    0.0428227  0.0036457  11.746  < 2e-16 ***
## BATTING_2B  -0.0172282  0.0089438  -1.926 0.054197 .  
## BATTING_3B   0.0917609  0.0161863   5.669 1.62e-08 ***
## BATTING_BB   0.0643346  0.0041503  15.501  < 2e-16 ***
## BASERUN_SB   0.0209334  0.0039963   5.238 1.77e-07 ***
## PITCHING_H   0.0027985  0.0008467   3.305 0.000964 ***
## PITCHING_HR  0.0466418  0.0083051   5.616 2.19e-08 ***
## PITCHING_BB -0.0323388  0.0034282  -9.433  < 2e-16 ***
## PITCHING_SO -0.0031173  0.0016399  -1.901 0.057446 .  
## FIELDING_E  -0.0425346  0.0051451  -8.267 2.31e-16 ***
## FIELDING_DP -0.1466897  0.0132112 -11.103  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.01 on 2263 degrees of freedom
## Multiple R-squared:  0.3132, Adjusted R-squared:  0.3098 
## F-statistic:  93.8 on 11 and 2263 DF,  p-value: < 2.2e-16

We can use a simple decision tre to help visualize feature importance.

4. Select Models

For the multiple linear regression model, will you use a metric such as Adjusted R2, RMSE, etc.? Be sure to explain how you can make inferences from the model, discuss multi-collinearity issues (if any), and discuss other relevant model output. Using the training data set, evaluate the multiple linear regression model based on (a) mean squared error, (b) R2, (c) F-statistic, and (d) residual plots. Make predictions using the evaluation data set.

References

  • A
  • B

Appendix

A. Moneyball Dataset Columns

  • INDEX: Identification Variable(Do not use)
  • TARGET_WINS: Number of wins
  • TEAM_BATTING_H : Base Hits by batters (1B,2B,3B,HR)
  • TEAM_BATTING_2B: Doubles by batters (2B)
  • TEAM_BATTING_3B: Triples by batters (3B)
  • TEAM_BATTING_HR: Home runs by batters (4B)
  • TEAM_BATTING_BB: Walks by batters
  • TEAM_BATTING_HBP: Batters hit by pitch (get a free base)
  • TEAM_BATTING_SO: Strikeouts by batters
  • TEAM_BASERUN_SB: Stolen bases
  • TEAM_BASERUN_CS: Caught stealing
  • TEAM_FIELDING_E: Errors
  • TEAM_FIELDING_DP: Double Plays
  • TEAM_PITCHING_BB: Walks allowed
  • TEAM_PITCHING_H: Hits allowed
  • TEAM_PITCHING_HR: Homeruns allowed
  • TEAM_PITCHING_SO: Strikeouts by pitchers

R Code

# Show all code here